Full CRAM 3.1 support & performance tuning of CRAM read/write#1763
Conversation
|
For testing, hts-specs has a corpus of test files designed to validate CRAM decoder implementations. https://github.com/samtools/hts-specs/tree/master/test/cram The 3.0/README.md covers the basics of the CRAM structure, adding in feature at a time. So blank file with header, then basic one-read bits without reference based encoding, etc. This is mostly testing structures, features, encodings, and so on. By the end of it it's also testing compression codecs. The 3.1 directory has 4 CRAM files with progressive levels of new compression codecs, basically htslib's profiles I think. They're whole file real world examples, as beyond the new codecs there are no format changes. More useful for 3.1 is the codecs directory which have raw codec compressed byte streams for you to decode and validate. These are shrunk down copies of the data in https://github.com/jkbonfield/htscodecs-corpus/tree/master/dat, although I think that may be a little incomplete now. Basically it's designed to test things like rans, fqz, arith and name tokeniser codecs in isolation, for unit testing purposes. This is mostly just integration testing. The nature of CRAM is that even with the same codec we could produce subtly different encoded byte streams (eg it's up to the implementation how it approximates the probabilities tables in rANS and handles rounding errors when the frequencies don't divide up perfectly, but the data stream contains the necessary lookup tables to decode so that's what matters). So what we need for testing are:
|
|
Some comments regarding CRAM 4.0. This is long, but it may explain why I think CRAM 4 is now something to ignore. (Mainly speaking to you as someone looking at adding it for Noodles). Personally, I think I made an error when I added it to htslib, but the history here is important to understand how we got to where we did. Initially I had two aims to improve on CRAM 3.0:
Fixing the CRAM format included things like replacing ITF8/LTF8 with a true universal variable length encoder that doesn't need to know the expected bit-size (somewhat of an anathema to proper variable length encoding). It also separated signed from unsigned values, so we no longer take 5 bytes to store -1 (or worse for LTF8). It also gained better ways of doing constant values. I also fixed some deficiencies such as how to discover whether the original data had MD/NM (by adding MD:* and NM:* tags with the "*" being auto-generated on decode, so the presence and location are known). Further improvements would be around =/X vs M in cigar, and so on. There are quite a few little improvements to make it work better as a round-trip tool. I added a CRAM v4 idea tracker on hts-specs not just because I was hoping for community ideas, but primarily as a way to date stamp the ideas I had and prevent patent squatters like MPEG from patenting obvious future improvements! Looking at the compression side, I regret not adding in zstd, but it was a bit too new then. Zstd should totally replace Gzip (deflate) and LZMA. Deflate is just hampered by a maximum window size of 32kb, which makes deduplication - the core part of LZ - fail on long reads and similar data types. At high levels zstd approaches lzma for ratio. It's not quite as good, but close, and frankly lzma is almost never worth the CPU cost so I doubt it's used much. Similarly bzip2 should be ditched in favour of libbsc, which betters it on pretty much every aspect. Combined we'd be reducing the number of dependencies while also using more modern high performant libraries. There's even a branch of io_lib with libbsc and zstd. This was demonstrated to MPEG at a conference in their first call-for-proposals on a new genomics format. Something I now strongly regret taking part in! (Ultimately they only took the name tokeniser, but have now regretted going with their own entropy encoder as it's simply way too slow.) The other aspect was improving ratios via custom codecs. The old rANS wins here because it's so much faster as an entropy encoder, and sometimes entropy encoding (or minimal statistical encoding) is the best you can get, so the complex LZ searching methods aren't worth it. (Before CRAM 3.0 we even had zlib being tried with Z_HUFFMAN_ONLY as an option, because it sometimes beat normal zlib while being vastly faster.) The ideas from there eventually ended up in CRAM 3.1. One aspect of that was the notion of data transforms. This is an idea taken out of PNG, and later used by myself for ZTR. Data starts off as raw, but can have a transformation applied to it. Decoding is like peeling the layers off an onion. If it's not raw format yet, you pass it through the transform indicated in the data-stream and repeat, until you get to the original raw data. These could be delta if it's a series of integer values; size squashing - eg an array of 16bit values becoming 8-bit with an escape mechanism for out of bound data; run-length encoding; bit-packing (eg ACGT strings taking up 2bits per value instead of 8); or even bigger transformations such as BWT and MTF. These can be implemented in CRAM as layered encodings, much like we already have BYTE_ARRAY_LEN which contains two sub-encodings for the literal and length fields. RLE would be a direct comparison to that in data layout. I initially implementing PACK and RLE here. CRAM 4 was an experimental melting pot of compression improvements along side the format changes. However it rapidly became apparent that getting any traction on these notions was going to be slow in htsjdk, so I took the decision to attempt to produce a cut down CRAM 4, renamed to CRAM 3.1. The theory was if we only had compression codec changes and we had a high performance separate library to implement them then the code changes needed in other CRAM implementations were very minimal; just updating the list of legal codec numbers and attaching them to callouts to a library API. This isn't much different to how existing implementations may already be handling lzma, bzip2 and deflate formats. (Although I'm aware Java has some native implementations, I was incorrectly assuming the faster C implementations were being utilised mostly.) A consequence of this is the separate onion-layer transform approach had to go, as that was an CRAM encoding change rather than a trivial codec API (ie compress-buffer and decompress-buffer functions). They got consumed into the rANS-Nx16 codec instead, although minus delta which was a mistake as it would greatly help certain tag types such as ONT signals). It wasn't such a bad idea, but ultimately it didn't help at all as there was no willingness to simply link against an external library yet we suffered splitting CRAM updates into two new versions instead of 1. There are still improvements that could be made with CRAM, but I wouldn't make them now I think so ultimately I think CRAM 4.0 is dead in the water. I'm considering removing it from htslib, if only to reduce the potential attack vector for supplying malicious files. If I was to pick this up again, it'd probably be to start again from scratch and build on something like the recent OpenZL. It wouldn't have the same compression ratios as archive mode CRAM, unless we extended it, but I like the notion of a standard structure that off the shelf tooling can decode. It's like BGZF is basically just decodable with zcat, but with openzl it could include all those useful chunking strategies and data transformations. I still have plans for a BGZF2, but based on OpenZL now instead of zstd. I just got sucked into other projects and my time evaporated to work on this, but one day! My rough prototype showed enormous gains for multi-sample VCF and long-read SAM files. It's not up to CRAM performance, but closer to CRAM than BAM for Revio and with a rewrite to use OpenZL it ought to be far closer still. |
|
@jkbonfield Thank you so much for both your pointers on testdata in htsspecs, and the history around CRAM 4. That's a lot to digest, and I think probably I'll reach out via email re: future formats (or finally make it to a GA4GH call). But the short version is that between AI-assisted coding and Project Panama in Java 21/22+, I think it might be much easier to spin up Java wrappers around native libraries and build Java-enabled implementations of future compression formats much faster. |
94ec6dc to
c944b6c
Compare
nh13
left a comment
There was a problem hiding this comment.
This needs some battle-testing; it's too big for a static code review :/
| public ByteBuffer decompress(final ByteBuffer input) { | ||
| // Read ISIZE from the last 4 bytes of the GZIP block to size the output | ||
| final int isizeOffset = input.limit() - 4; | ||
| final int isize = input.duplicate().order(ByteOrder.LITTLE_ENDIAN).position(isizeOffset).getInt(); |
There was a problem hiding this comment.
issue:
ISIZE is mod 2^32 so you use Integer.toUnsignedLong for the 2GB-4GB case, check if isize is zero, but no idea how to handle if the actual size was > 4GB. I think it doesn't matter for CRAM, but this is a public function...
| } | ||
|
|
||
| private void grow(final int minCapacity) { | ||
| int newCapacity = buf.length << 1; |
There was a problem hiding this comment.
nitpick: could overflow if buf.length > 1GB
| ### CRAM 3.1 Write Support | ||
|
|
||
| - Enable CRAM 3.1 writing with all spec codecs: rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE | ||
| - Add configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE) with trial compression for automatic codec selection |
There was a problem hiding this comment.
suggestion:
| - Add configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE) with trial compression for automatic codec selection | |
| - Add configurable compression profiles (FAST, NORMAL, SMALL, ARCHIVE) with trial compression for automatic codec selection |
|
|
||
| > **NOTE: _HTSJDK has only partial support for the latest Variant Call Format Specification. VCFv4.3 can be read but not written, VCFv4.4 can be read in lenient mode only, and there is no support for BCFv2.2._** | ||
|
|
||
| > **NOTE: _HTSJDK now supports both reading and writing CRAM 3.1 files. CRAM 3.1 write support includes all codecs defined in the specification (rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE), configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE), and trial compression for automatic codec selection. Files produced by htsjdk are interoperable with samtools/htslib._** |
There was a problem hiding this comment.
suggestion:
| > **NOTE: _HTSJDK now supports both reading and writing CRAM 3.1 files. CRAM 3.1 write support includes all codecs defined in the specification (rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE), configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE), and trial compression for automatic codec selection. Files produced by htsjdk are interoperable with samtools/htslib._** | |
| > **NOTE: _HTSJDK now supports both reading and writing CRAM 3.1 files. CRAM 3.1 write support includes all codecs defined in the specification (rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE), configurable compression profiles (FAST, NORMAL, SMALL, ARCHIVE), and trial compression for automatic codec selection. Files produced by htsjdk are interoperable with samtools/htslib._** |
| @@ -107,7 +109,7 @@ public void writeAlignment(final SAMRecord alignment) { | |||
| */ | |||
| // TODO: retained for backward compatibility for disq in order to run GATK tests (remove before merging this branch) | |||
There was a problem hiding this comment.
suggestion: can we remove this now?
| int basesRemaining = 0; | ||
| boolean firstLen = true; | ||
| int lastLen = 0; | ||
| int qualOffset = 0; |
| @@ -335,84 +402,150 @@ private static void writeString(final ByteBuffer tokenStreamBuffer, final String | |||
There was a problem hiding this comment.
nitpick:
| tokenStreamBuffer.put(val.getBytes(StandardCharsets.UTF_8)); |
| this.byteOffsetOfContainer = containerByteOffset; | ||
|
|
||
| final ContentDigests hasher = ContentDigests.create(ContentDigests.ALL); | ||
| // htslib does not write content digest tags (BD/SD/B5/S5/B1/S1) into slice headers. |
There was a problem hiding this comment.
suggestion: add to breaking changes in the CHANGELOG?
| import htsjdk.samtools.*; | ||
| import htsjdk.samtools.cram.structure.CRAMCompressionProfile; | ||
| import htsjdk.samtools.cram.structure.CRAMEncodingStrategy; | ||
| import htsjdk.samtools.util.CloserUtil; |
There was a problem hiding this comment.
suggestion:
| import htsjdk.samtools.util.CloserUtil; |
| * @param reader the reader to read from | ||
| * @return the decoded value | ||
| */ | ||
| public static long readUnsignedLTF8(final CRAMByteReader reader) { |
There was a problem hiding this comment.
suggestion: these are super dense, ITF8.java and the rest of this code I think is more readable.
Initial implementation of CRAM 3.1 write support, contributed by Chris Norman. Adds a naive write profile that writes valid CRAM 3.1 files without yet using any of the format's new specialised codecs (FQZComp, Name Tokeniser, RangeCoder, etc.); all data series continue to use CRAM 3.0 codecs (rANS, GZIP). This squashes the four prior commits: - Enable a naive CRAM 3.1 write profile. - Changes based on instrumented trial runs. - CRAM 3.1 write tests and code cleanup. - Temp fix for preSorted=false default. This commit serves as the foundation that the following commit builds on to add the full CRAM 3.1 codec set, profile system, and performance optimisations. Co-authored-by: Tim Fennell <tim@fulcrumgenomics.com>
Builds out CRAM 3.1 write support on top of cnorman's initial naive profile, taking htsjdk to feature parity with samtools/htslib for writing CRAM 3.1 and matching samtools on cross-implementation fidelity across a broad range of real-world data. CRAM 3.1 codecs and supporting infrastructure --------------------------------------------- - Full CRAM 3.1 codec set: rANS Nx16 (Order 0/1, RLE, Stripe, Pack, Cat), FQZComp (quality scores), Range adaptive arithmetic coder, and the Name Tokeniser. - New compression profile system (FAST, NORMAL, SMALL, ARCHIVE), matching htslib's `--output-fmt-option fast/small/archive`. Each profile picks per-DataSeries compressors and (for SMALL/ARCHIVE) trial-compression candidate sets. - TrialCompressor: a wrapper that tries multiple compressors per block and keeps the smallest output. Replaces the ad-hoc tag triple- compression path with a uniform, profile-driven mechanism. - DataSeries content IDs aligned with htslib so CRAM dumps from the two implementations are directly comparable. Codec performance work ---------------------- - rANS codecs reworked to use a `byte[]` API and backwards-write, removing per-byte stream overhead and several layers of indirection. - GzipCodec uses Deflater/Inflater directly instead of going through GZIPInputStream/OutputStream, avoiding gzip framing overhead per block. - Name Tokeniser encoder: hand-written tokenisation replaces regex, per-type flags + STRIPE + duplicate-stream elimination + all-MATCH fast path significantly improve both speed and ratio. - FQZComp / Range coder / rANS encoder hot paths tightened. - Pooled RANSNx16Decode instance in the Name Tokeniser to avoid reallocating per call. - Replaced ByteArrayInputStream/OutputStream with unsynchronized CRAMByteReader/Writer to remove monitor overhead in tight loops. - Cached SAMTag key metadata to eliminate per-record String allocation during CRAM decode. - Fused read-base restoration, CIGAR building, and NM/MD computation into a single pass instead of three. - jlibdeflate compatibility fix on the ByteBuffer path. - Roughly 15% faster encoding overall vs the pre-optimisation state, with read decoding gains in line. Correctness fixes found during cross-implementation validation -------------------------------------------------------------- - TLEN is now computed using htslib's coordinate-extent rule (max(end1,end2) - min(start1,start2) + 1, signed by leftmost position) for CRAM-specific compute, rather than the SAM 5'-to-5' rule. Without this, every read with a supplementary alignment mismatched on TLEN through any CRAM round-trip. - CompressionHeader serialisation now uses a growable ByteArrayOutputStream rather than a fixed 100 KB ByteBuffer for the preservation map and tag encoding map. The TD tag dictionary alone can exceed 100 KB for rich tag sets (PacBio/Ultima flow-space, ONT modified-base tags) under the archive profile, where it would previously throw BufferOverflowException. - Slice flush now uses a dual record/bases threshold matching htslib's `seqs_per_slice` AND `bases_per_slice` (default = readsPerSlice * 500). Without this, archive-profile encoding of long-read data (PacBio HiFi 15 kb, ONT 30 kb+) would buffer ~1+ GB of quality scores per slice and OOM the FQZComp encoder. - Strip NM/MD on encode and regenerate them on decode (matching htslib's default `store_nm=0`/`store_md=0` behaviour). Implemented attached mate pairs to align the in-memory representation with the spec. - Restored CIGAR reconstruction when SEQ is `*` (CF_UNKNOWN_BASES). - Fixed crash when reading containers that contain no slices. - Removed unnecessary content digest tags from CRAM slice headers (htslib doesn't write them either; htsjdk's verification on read was overly strict). Tools ----- - `CramConverter`: a small command-line tool for converting between SAM/BAM/CRAM, primarily for benchmarking and exercising profiles. - `CramStats`: dump per-block compression statistics from a CRAM file for debugging and ratio analysis. Tests and CI ------------ - New hts-specs CRAM 3.0/3.1 compliance tests covering decode, index query, and round-trip behaviour. - FQZComp round-trip tests using the hts-specs quality data files. - CRAI index query correctness tests; codec roundtrip property tests. - Split CRAM31 fidelity tests into per-profile classes for parallel execution. - Reduced memory pressure in the unit tests to eliminate intermittent OOM failures on CI. - Sped up several long-running CRAM tests by caching test data, reducing slice-size matrices, and downsampling fixtures. - Misc: fixed a thread-safety bug in VariantContextTestProvider that was producing non-deterministic test counts. CHANGELOG / README updated for the 5.0.0 release. Co-authored-by: Chris Norman <cnorman@broadinstitute.org>
Snakemake pipeline that round-trips a curated set of public BAM/CRAM
files through htsjdk and samtools and verifies output equivalence
across all four CRAM 3.1 compression profiles (FAST, NORMAL, SMALL,
ARCHIVE) and all four encode/decode combinations. The pipeline was
used to validate the CRAM 3.1 write implementation in the previous
commit against samtools 1.21 on real-world data spanning five
sequencing platforms (Illumina, PacBio HiFi, ONT, Element AVITI,
Ultima), seven assay types (WGS, WGBS, RNA-seq, scRNA-seq, exome,
Hi-C, amplicon), and two organisms.
Pipeline (validation/)
----------------------
- Snakefile: encode original -> CRAM with both htsjdk and samtools,
decode each CRAM with both, then compare each decoded BAM against
the original. 4 profiles * 4 encoder/decoder combinations = 16
comparisons per sample.
- Two sample-sheet formats:
samples.tsv - local file paths (3 columns)
test_datasets.tsv - remote URLs auto-downloaded (9 columns)
Format is auto-detected from header. Reference URLs may be a
comma-separated list, with optional `#contig_name` suffix to rename
a single-sequence FASTA's header on download (e.g. Lambda phage
chrL spike-in for bisulfite samples).
- Intermediate files (BAMs and CRAMs) are marked temp() so they are
cleaned up as soon as their dependents finish; rule priorities push
the DAG to drain depth-first, keeping peak disk bounded.
- Per-rule log directives, retry/escalating-memory ladder for the
Java jobs (4 GB -> 6 GB -> 8 GB -> 10 GB), and pixi.toml for a
reproducible toolchain.
CramComparison
--------------
A general-purpose record-by-record SAM/BAM/CRAM comparison utility
used by the pipeline as the comparator. Tolerates the universal CRAM
roundtrip normalisations:
- CIGAR =/X collapsed to M before comparison (CRAM stores match/
mismatch in read features, so the =/X distinction is lost on
decode in both htsjdk and samtools).
- mapQ ignored for unmapped reads (SAM spec leaves it undefined,
and both implementations normalise it to 0 on decode).
- Auto-generated MD/NM tags stripped when one side is missing them.
- Unsigned B-array type differences tolerated (CRAM stores signed).
Exit code 0 on both match and mismatch; exit 2 only on actual error,
so Snakemake preserves the result file in either case.
Co-authored-by: Claude <noreply@anthropic.com>
|
Tests pass locally, and are still running into OOM issues in CI. I'm going to merge this to dev and address there if it's still an issue once some other PRs land. |
|
I should note for the record that this PR adds a metric ton of unit tests for the CRAM code, and has been battle-tested by transcoding a broad variety of real-world data from BAM to CRAM and back again with each compression profile, with samtools and htsjdk via the included validation pipeline. The validation pipeline uncovered a small number of subtle bugs that have now been fixed! |
Description
Ok, so this is a monster PR. I'm not even sure it should be a single PR, but I wanted to put it up here to get some feedback on how best to proceed. I've created it as a draft for that reason. It builds upon the work by @cmnbroad in #1739 while implementing the necessary additions to write CRAM 3.1 at comparable compression ratios to htslib/samtools.
This was done with the assistance of Claude Code, though it also represents a significant investment of my own personal time, prompting, reviewing, fixing, redirecting.
The PR breaks down roughly as follows:
Many of the additions and changes are strongly inspired by work in htslib!
Currently with the
normalprofile selected, htsjdk emits CRAMs approximately similar in size to htslib. With some 1KG low-pass data I've been testing with the results are actually slightly smaller. However runtime to generate the CRAMs is ~2x slower than htslib/samtools. I think that gap could be shrunk further (it started out around 5-6x slower), I think there will always be a substantial gap due to a) not having easy access to SIMD instructions and b) not having a goodlibdeflatewrapper on the JVM.The read path for CRAM 3.1 predates this PR, but the changes here provide substantial performance improvements. Again, testing on low-pass WGS data I'm seeing a 40-50% speedup.
Assuming this PR/changeset has legs, I will do some broader benchmarking and report on both compression and runtime performance.
@jkbonfield I wouldn't expect you to review this PR, but if you have thoughts on how to validate the results of this work beyond the broad set of CRAM unit tests in the htsjdk repo, I would be very grateful for your input.
Things to think about before submitting: